Eliminating Noisy Information in Web Pages using featured DOM tree

نویسندگان

Shine N. Das

Pramod K. Vijayaraghavan

Midhun Mathew

چکیده

The exact information retrieval from the Web is now a great challenge for the researchers to device new methodologies for web mining. Due to the massive information on the Web, the size and number appear to be growing rapidly at an exponential rate which is often accompanied by a large amount of noise such as banner advertisements, navigation bars, copyright notices, etc. Although such information items are functionally useful for human viewers and necessary for the web site owners, they often hamper automated information gathering and web data mining. The efficiency of feature extraction and finally classification accuracy are certainly degraded due to the presence of such noisy information. Thus cleaning the web pages before mining becomes critical for improving the mining results. In our work, we focuses on identifying and removing local noises in web pages to improve the performance of mining. We propose a novel and simple idea for the detection and removal of local noises using a new tree structure called featured DOM Tree. A three stage algorithm is proposed in which feature selection is done in the first phase, a featured DOM tree is created in the second phase and noise is marked and pruned in the third phase. The experimental results show that our algorithm outperform in terms of various benchmark measures and an increase in F score and accuracy is obtained as a result of automatic web page classification. General Terms Web content mining. Web page classification.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web Page Performance Enhancement by Removing Noise

Data mining is the procedure of extracting or taking out the information from the huge set of data. Web Mining is an important application of data mining, which is to extract knowledge from Web data including Web documents, hyperlinks, usage logs of web sites, etc. A Web Page contains many blocks such as content blocks, copyrights, privacy notes and advertisements. These blocks like advertiseme...

متن کامل

Eliminating the Noise from Web Pages using Page Replacement Algorithm

Data mining is the process of mining information from the large set of data. It further has many categories like text mining web usage mining and web content mining. There are many types of algorithm which are used in web mining i.e. Visitor method, Dom tree and least recent used algorithm. Visitor and Dom tree is the complex and time consuming method. Least Recent Used algorithm is less time c...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

Analyzing new features of infected web content in detection of malicious web pages

Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery...

متن کامل

Retrieve Information Using Improved Document Object Model Parser Tree Algorithm

The Data mining refers to mining the useful information from raw data or unstructured data. Whereas in web content mining the data is scattered or unstructured on web pages. Some time the user wants to retrieve only fix kind of data, but the unwanted data is also retrieved. The unnecessary information can be removed with this proposed work. The DOM Parser Tree Algorithm to filter the web pages ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Eliminating Noisy Information in Web Pages using featured DOM tree

نویسندگان

چکیده

منابع مشابه

Web Page Performance Enhancement by Removing Noise

Eliminating the Noise from Web Pages using Page Replacement Algorithm

Data Extraction using Content-Based Handles

Analyzing new features of infected web content in detection of malicious web pages

Retrieve Information Using Improved Document Object Model Parser Tree Algorithm

عنوان ژورنال:

اشتراک گذاری